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Abstract 

We present a new R package which takes a numerical matrix format 
as data input, and computes clusters using a support vector clustering 
method (SVC). We have implemented an original 2D-grid labeling ap¬ 
proach to speed up cluster extraction. In this sense, SVC can be seen 
as an efficient cluster extraction if clusters are separable in a 2-D map. 
Secondly we showed that this SVC approach using a Jaccard-Radial base 
kernel can help to classify well enough a set of terms into ontological 
classes and help to define regular expression rules for information extrac¬ 
tion in documents; our case study concerns a set of terms and documents 
about developmental and molecular biology. 

Keywords: unsupervised learning, support vector clustering, lexical cluster¬ 
ing, pattern discovery, grid-based labeling, ontology, terminology, jaccard-radial 
kernel 


1 Introduction 

Mining text archives is a great challenge since lots of documents are available 
and their amount grows in the same way as the capacity of computer storage. 
Making rules of a domain for knowledge extraction involves efficient features 
with low semantic ambiguity. It is not an easy task and we try to answer this 
question by representing vectors of linguistic expressions (i.e. terms) by features 
and using a scalable density-based distance to cluster the terms. 

The first idea for our problem concerns the choice of a density-based method 
and the improvement of its scalability. Clustering can be a useful knowledge- 
poor technique to induce organization into scattered data [50]. Non-parametric 
methods such as support vector machines can be interesting to analyze noisy 
data by density processing. DEZ1ES] proposed an unsupervised support vector 
algorithm to enclose data clusters by contours and based it on a radial kernel. 
Diverse applications have been tested for novelty detection mm, rule ex¬ 
traction [33], dsoxyribo-nucleic acid (DNA) and chemical compounds [51H5] or 
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image processing [5]. The method of point assignation to contours and related 
clusters is based on adjacent points between each pair and is time-consuming. 
Some studies [32l|34] have been proposed to speed-up the method. In particular 
[43] and |26j proposed an improved method to label clusters, i.e. to assign point 
to clusters by graph analysis. 

We present a new robust method based on the computation of a hash function 
through surrounding points working with a grid which we map to data using 
a k-nearest neighbor method. We developed this clustering method under the 
R platform [40], as a package called svcR, and we compared our approach to 
other ones, especially graph-based, on the Iris dataset (svcR is available from the 
Comprehensive R Archive Network at http: //CRAN. R-pro j ect. org/package= 
svcR’). [22] have also developed a support vector method for clustering but us¬ 
ing a divisive way iteratively searching a classical hyperplan separator based on 
classical support vector machine. The first step tries to separate the data set 
and a set artificially build in the same space of attribute values and the same 
size than the data set which is theoretically not justified; it seems that if not 
many classes are present (2 or 3) and not many attributes describe data, the 
algorithm seems to find groups, in other cases it tends to find a number of final 
clusters equal to the number of iterations. 

The second idea presented in this paper concerns our original usage of support 
vector clustering (SVC) clustering methodology cited above for solving a certain 
form of ambiguity in natural language. Information retrieval Eli and infor¬ 
mation extraction |24l 1391 are key methodologies to retrieve information from 
text archives. But simple keywords may have several senses and assignment of 
term to conceptual classes should be important [HunuMunnis]. Clustering 
may be used to reduce the number of variables to take into account in rules 
for information mining in documents. We base our assumption on two works. 
Firstly that collocation analysis is useful to understand morphological structure 
and its link to a conceptual space mm- Secondly that clustering can bring a 
good approach to build semantic classes with the help of a distance of similarity 
[331 M- This methodology about clustering linguistic terms can help to get 
common features to build rules for information mining in document archives. 
Classification of a set of terms requires to represent data as morphological infor¬ 
mation vectors (terms themselves or parts of terms and how?) and to determine 
which kernel has to be used to achieve SVC. We try to use whole terms, mor¬ 
phological primitives and bigrams as morphological information. And we try 
to use the Levenshtein distance and the Jaccard similarity index, Radial ba¬ 
sis function, and combination Levenshtein-Radial or Jaccard-Radial kernels to 
study the clustering effect. 

In Section [2] we introduce the methodology of support vector clustering. Sec¬ 
tion [3] presents the labeling approach and Section [4] gives studies of vector rep¬ 
resentation and different kernels for term clustering. Finally Section [5] shows 
evaluation of the technique. 


2 Support vector clustering 

In this section, we recall the clustering approach. 
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2.1 Kernel trick 


We know a priori classes of items (red circles and yellow squares) and we search 
a linear frontier in a higher dimensional space. For that, data are transformed 
using a kernel function (dot product). Preprocess the data with: 


$ : X -> X 

x —» x (1) 

K is a dot product of the space (Hilbert space, H ), and learn the mapping 
of $(x) to y (class). 

learning data x in X is a multivariate data on X d , 

where d is the number of feature (2) 

(4>(x), < f ) (x / )) = K(x, x') can be computed in X d (3) 

(||.|| is the norm associated to the K dot product) 

Usually dim(X) « dim(H) 

2.2 Optimization 

As an extreme view the distribution of data under the scope of unsupervised 
learning can be interpreted as density estimation. But in our case the approach 
estimates quantiles of a distribution, not its density. In the case of SVC, we 
determine support vectors to delimit the distribution of points. The goal now 
points out to find the minimal sphere which surrounds data. One can show: if 
*I > (x),...,$(xjv) is separable from the origin in H, then the solution of margin 
minimization between two classes corresponds to the normal vector of the hy¬ 
perplan separating the data from the origin with maximum margin. 

In our case we try to encapsulate data into a ball. The points inside the ball 
represent data to classify (first) and the origin represents the second class. Pri¬ 
mal problem is written as follows. Let a the (non-fixed) center of the ball, R 
the radius of the ball and C is a fixed penalty constant controlling the number 
of data near the ball. Let us minimize: 

F(i?,{ n }ti) =R 2 +cY / r i (4) 

i 

Under the constraints: 


||$(xi) — a\\ 2 < R 2 +Ti , where r* > 0 for all i = 1, ...,7V (5) 

a is the center of the ball. The dual problem (for a convex problem) is 
the Lagrangian L(R, a, {rj^ , {ft}^ , {Mill)- Where ft > 0 and m > 0 
are Lagrange multipliers, see ®mm for details about computation of ft. 
Multipliers are used to compute a (i.e. a) and $(x). 
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If x is a support vector, the radius is: 


£ 2 = || d >(*)- a || 2 ( 6 ) 

For any point y the distance Ry from the center is: 

R 2 y = My)-a \\ 2 (7) 

Hence it is possible to test if y is inside the sphere or not by comparing R 
and R y . 

2.3 Contour deformation 

The value of parameter v = asymptotically represents a max bound of 
the BSV rate. Parameter C takes values in [-^,1], as C is reduced, more and 
more points are labeled as outliers. If q increases the Gaussian radius decreases 
and the number of SV increases. Subsequently if one or more points of a cluster 
become support vector a specific contour will be generated for the cluster. From 
a certain value of g, support vectors appear around each cluster. 

3 Geometric hashing based labeling 

In this section, we describe our mapping methodology to assign data points to 
clusters. 

3.1 2-d grid assumption 

In the previous method only support vectors guide processing to make contours 
but escape to know if a given point lies inside or outside the contour. Some 
methods such as describe in the foundation work by [T| , and | l.i| work with an 
adjacency matrix defined as follows. Given two points of the data Xi and Xj 
and R (the radius of the ball), the adjacency matrix A such that: 


Ai _ \ 1 if any point between Xi and Xj is such that R(xk) < R] /o', 

j “ \ 0 else. 1 ’ 

Hence [T define a set of points between each pair and calculate if all the 
points belongs to the sphere or not, and so assign the pair of point to a cluster; 
In the second method [43] use a graph method to analyze the density areas of 
the graph defined by the adjacency matrix. We have compared our approach 
to these ones we call respectively in the following nearest-neighbors (NN) and 
minimum spanning tree (MST). These methods are time-consuming and we 
imagined a method based on a geometric hashing function achieved with a grid 
surrounding data points in the attribute space. Basically according to the SVC 
method we only compute the radius for the points of the grid (that are hash 
keys) to build clusters, and as [5] we assume that almost closest points can be 
associated to a same hash. We use a nearest-neighbor method [7] to associate 
data points to their hash. 
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3.2 Algorithm 

The basic idea behind random projections is a class of hash functions that are 
locally sensitive; that is, if two points (a, b) are close, they will have small 
\p — q | values and they will hash to the same value with a high probability. 
If they are distant they collide with small probability. We have the following 
definitions. Let M be the size of the grid, and fixed by the user. A 2-dimension 
M x M grid is characterised with a step s. The step s is defined according 
the minimal/maximal value of two first coordinates obtained by correspondence 
analysis (COA), cl is the first coordinate, and c2 the second coordinate. We 
use the ade4 package of R-project to compute COA m ■ Let gi be a grid point. 


Definition 3.1 (s c k) 

Let s c k be the scale of the grid from correspondance coordinates ck 


Sck — 


{max {gf}” =1 - min {gf}^) 


M 


, ck = {cl, c2} 


(9) 


We can define the set of grid points gM with each point g t by: 


Definition 3.2 (ga) 

Let gM be the set defined by: 

goisck ) = {gi ■ dim(gi) = 2 and {gf - gff = s ck , i G [1 : G]} ,ck = {cl,c2} (10) 

For each point g t we can assess membership to clusters without specifying 
which one. 


Definition 3.3 Let be C = {cj} the set of clusters, knowing radius R accord¬ 
ing Equation [7] 


9l G C if (R - R{gS) > 0 (11) 

We now try to define clusters set with grid points: 

Definition 3.4 (C) 

We call C the set of clusters. A cluster consists of a grid point and all neigh¬ 
bouring grid points: 


C = {ci : 3j gj G C; A g k G Cj 

if 9k e [9? - f> 9? + f] > 9k G [gf - 1 ,9? + 1] } (12) 

Now we define the ball as the neighborhood of the hash key {X, Y ) from 
which it is assigned a specific cluster reference Cj using a k-nearest neighbor 
threshold: 

Definition 3.5 B k {X,Y) 

Let go the grid, C the set of clusters and P(P c i, P C 2 ) a point with coordinates 
(X, Y) in the grid space GxG. Then the ball of P B k X, Y is defined by at least 
k neighbours belonging to a same cluster: 

Bk{X,Y) = {cj : gi G Cj 

A gf G [X — 1, X + 1 ],gf G [T — 1, Y + 1] A #* > k} (13) 
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A family H = h : F ^ G is called locality-sensitive if, for any point a, the 
function p(u) is defined as follows: 

Definition 3.6 p(u) 


p(u) = Prn[h(a) = h{b ) = (X,Y) : \a — b\ < u, 

E(a c i) = E(b c i) = X, E(a c2 ) = E(b c2 ) = Y] (14) 

p(u) decreases in u. That is the probability of the collision of points a and 
b decreases with the distance between them. 

After defining a grid on data space, ClusterLabeling function achieves the first 
stage assigning a cluster number to each point of the grid. The calculation 
of Lagrange coefficient gives the kernel matrix (MK). User settles the size of 
grid G, and MinMax value in data space can be computed. The main function 
(findSvcModel, described in next chapter) outputs a matrix called NumPoints 
linking each grid point to a cluster id. The radius Rc can be computed according 
algorithm shown in Table [T] 
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Algorithm 1 

Require: kernelmatrix MK, grid size G, MinMaxX min max value of x value in data, MinMaxY 
min max value of y value in data. 

Ensure: NumPoints, a GxG vector for each grid point and membership to a cluster id. 

while each GxG Grid point P do {we identify if a point belongs to a possible cluster} 
Associate x, y values to P from MinMaxX and MinMaxY 
Calculate Radius Rp of P , if Rp j= Rc , give ball membership to P 

end while 

while each GxG Grid point P(i) do {we identify cluster id(s)} 
while each P(k) around P(i) of one step do 

if all points P(k) have no cluster membership then 

Create a new cluster vector CV with a new cluster id Cm 
Put CV in a list of cluster vector membership LCV 
Put P(i) to CV and associate Cm in NumPoints 

else 

associate cluster id of P(k) to P(i) in NumPoints 

end if 
end while 
end while 

while each CV(i) in LCV do {we merge closed clusters} 
while each other CV(j) in LCV != CV(i) do 

if CV(i) has distance of one step from CV(j) then 
Merge CV(i) and CV(j) 

Update NumPoints 
end if 
end while 
end while 


Table 1: Calculate Rc radius of the ball using MK. 

Finally we can assign a cluster label for any point x of the data set according 
the hash function and the corresponding ball value, defined in Equation |13| 


f(x) = Cj if B k (h(x)) = Cj 


(15) 


MatchGridPoint function, presented below, achieves the second stage; com¬ 


putation of f(x) in Equation 15 It returns a vector we call ClassPoints associ¬ 
ating a cluster id to each data point in the initial dataset (see Table [2]). 
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Algorithm 2 

Require: data matrix MD, grid size G, MinMaxX, MinMaxY, NumPoints, neighbourhood of a data 
point k. 

Ensure: ClassPoints, a vector for data point and membership to a cluster id. 

1: for each point D(i) in MD do 

2: Calculate Grid coordinate of any D(i) , with MinMaxX, MinMaxY 

3: end for 

4: for each point D(i) in MD do 

5: Init a score vector SV(i) with dimension of cluster id(s) 

6: for each Grid Point P(j) in NumPoints do 

7: if P(i) cluster id = k is found and distance between P(j) and D(i) = k then 

8: Increment SV(i)(k) 

9: Associate Max(SV) to Classpoint(i) 

10: end if 

11: end for 

12: end for 


Table 2: MatchGridPoint routine. 

3.3 Usage of the svcR package 

Main function is the findSvcModel function. It computes a clustering model 
and returns it as an R object which is usable to other function for display and 
export. Let call ret the return object, it covers some information about model 
parameters as the language coefficients (getlagrangeCoeff(ret)$A attribute), the 
kernel matrix (getMatrixK(ret) attribute) and the cluster memberships (get- 
ClassPoints(ret) attribute). findSvcModel takes 10 arguments : 

• data.frame means data.frame parameter in standard use 
or means data.frame in loadMat use 

or means DatMat in Eval use, a matrix given as unic argument 

• MetOpt, optimization control parameter : optimStoch (stochastic way of 
optimization) or optimQuad (quadratic way of optimization) 

• MetLab, labelling method: gridLabeling (grid labelling) or mstLabeling 
(mst labelling) or knnLabeling (knn labelling) 

• KernClioice, kernel choice: KernLinear (Euclidian) or KernGaussian (RBF) 
or KernGaussianDist (Exponential) or KernDist (Matrix data as Kernel 
value) 

• Nu, nu parameter 

• q, q parameter 

• k, k nearest neigbours for grid 

• G, grid size 
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• Cx, x component to display (1 for 1st attribute) 

• Cy, y component to display (2 for 2nd attribute) 

If Cx and Cy are 0 the correspondent analysis is used. The data is given 
as first argument. The format is data.frame() (i.e. list) as the iris well known 
dataset. Some R libraries are required as quadprog [2] for optimization, ade4 
m and spdep [5j for principal component analysis. This an exemple of usage 
in R : 


R> library("svcR") 

R> dataC'iris") 

R> retA <- findSvcModel(iris, MetOpt = "optimStoch", MetLab = "gridLabeling", 
+ KernChoice = "KernGaussian", Nu=0.5, q=40, K=l, G=5, 

+ Cx = 0, Cy = 0) 

R> plot(retA) 

R> ExportClusters(retA, "iris") 

R> findSvcModel.summary(retA) 

It means as data is the iris data frame. The Kernel choice is radial-based, 
parameters of SVC technique are nu = 0.5 and q = 40. Parameters for cluster la¬ 
beling are k = 1 neighbor and grid size of 5 x 5 points. Cx = Cy = 0 means that 
first two principal components are used. MetLab value means that geometric¬ 
hashing method is used. Plot function permits to visualize clusters. Export- 
Clusters outputs clusters in a file with variables names. findSvcModel.summary 
displays size and number of clusters, and averaged attributes for each cluster. 

Some functions can help the user to navigate in clusters. ShowClusters(retA) 
returns all clusters ordered by their id (cluster 0 is a bag of variables not clus- 
terable), GetClustersTerm(retA, term = ”121”) returns clusters in which ”121” 
is a substring names of a member include in them, and GetClusterID(retA, Id 
= 1) returns the cluster with Id = 1. 

3.4 Toy example 

We used the famous Fisher’s Iris data set. It contains 3 classes, 150 variables 
and 4 attributes. Our clustering extraction is largely based on the topology 
of points localized on a 2-D map. The dimensions of the maps are found by 
using a correspondence analysis and we kept the first two coordinates. The Iris 
data on these projection classes 2 and 3 are not well separated as it shown on 
Figure [I] So the method can catch well class 1 and from time to time it occurs 
a ’’bridge” between class 2 and 3 that links them to form one cluster (Figure [l}. 

The system is not very robust to force a so weak topological boundary. And so 
several iterations can force cluster 2 and cluster 3 to appear. For a grid size of 
G = 13, we obtain 50% of success after a certain number of run executions. 

The nearest neighbour parameter k is used to find the closest cluster for 
a given data point. Low values such as k = 1 or k = 2 give same level of 
precision evaluation parameter to obtain 3 clusters. But this approach is not 
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Figure 1: 2-D displays showing: data, clustered grid and data superimposed 
with clustered grid. Top left: Data plotted with COA cl = 1, c2 = 2; Top 
middle G = 11, #unclassified points is 17, #missclassified points is 9; Top 
right: G = 13, ^unclassified points is 2, #missclassified points is 7; Bottom: 
G = 30. 
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Figure 2: Clustering precision on iris dataset with parameters Nu = 0.7, q = 
1200, G = 13 . K is the number of neighbours for GRID or KNN-adjacency or 
number of links for MST-adjacency approach. 


sufficient for good level of precision when the size of the Grid is high (G > 25) 
because the distance of peripheral data point is too far from their cluster. We 
compare precision of our approach (’’GRID”) with two other variants based 
on an adjacency matrix construction. The first variant makes the adjacency 
matrix with a minimum spanning tree (”MST-adj”), the second uses k-nearest 
neighbours (”KNN-adj”). All the three approaches have an order parameter we 
call k, that is the number of nearest neighbours for GRID and KNN-adj, and 
the number of links of a node in the tree for MST-adj. Mainly two clusters are 
captured by any approaches, and precision is computed by number of points of 
majority class included in cluster divided by the number of points (150). For 
GRID, precision when k is small (between 1 and 3) is stable and competitive 
with other approaches (Figure [2]). 

A second stage of labelling using high-distance nearest neighbour should 
perform well at this size grid. But as we can see on Figure [3] (bottom) the 
running time for svcR is less interesting when G > 30 becomes higher, when G 
is between 5 and 25 time run can increase by 25% . We generally choose G in 
this range and the performance is not damaged compared to other approaches. 
If we look at Figure [3] (top) MST-adj is faster than KNN, and difference with 
svcR with 150 points is 2.05 times faster. Even with G = 25 we divides times 
by 1.65, hence we get back at least 40% of time. In summary we can see that 
for G = 13 and a data size N < 50 for any method the run time is almost the 
same but increases very fast for the NN method when data size increases. Our 
approach becomes interesting for a much higher amount of data. For the whole 
Iris Data set our approach is two times faster but the run time depends on the 
grid size. We can see on Figure [3] that if G > 25 it becomes less competitive. If 
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G = 150 time run is ten times than when G = 40, and twenty times than when 
G = 20. We used the quadprog package in R-Project for optimization. 


4 Representation of term sets and kernels 

In this section, we describe our good representation to classify term from a 
specific domain with an adapted kernel. 

4.1 Data, language models and domain knowledge 

In the previous chapter we have shown that a radial base function can make 
a suitable clustering. But the data were made of a few attributes and not 
coming from natural language surrounded by sense ambiguities. We tried to 
make an attempt to classify terms coming from a specific domain: molecular 
and developmental biology. 

Our linguistic data set consists of 1,893 terms (linguistic phrases) manually 
extracted from an annotation of 1,471 documents (5,730 sentences) where anno¬ 
tated linguistic phrases describe temporal stages of biological development. The 
corpus itself has been build manually grabbed from Medline document database 
about spore coat formation and gene transcription specifically for Bacillus Sub- 
tilis species. We define some ideas about the language model studied in next 
chapter. Let suppose the following phrase "septal localisation of the division”; 
it will be supposed to be a term. From this term we can consider different sub¬ 
structures. "septal” and "localisation” are considered to be distinct words, and 
for instance "sept” is supposed to be a radical i.e a sequence of character which 
can be found in other words, ’’septal localisation” is considered as a bigram, i.e. 
a sequence of two words, "localisation of the” is considered as a trigram, i.e. a 
sequence of three words. 

Textual corpus we used describes biological knowledge and especially a well 
known biological model called sporulation. This biological process is activated 
by a microorganism to be resistant in an environment with starvation. The 
bacterial is transformed into a resistant sphere with mininum needs and activity. 
In information extraction from texts gene network reconstruction is a quite 
interesting field to understand how a gene network is activated. Temporal and 
spatial information are complementary information useful to understand when 
gene interactions occured. A well studied biological process as sporulution can 
be a reference model with both interest: 

• Gathering enough molecular information about gene-gene interactions in 
texts since ten years; 

• Being a well described biological model across different stages. 

Six main stages describe the sporulation process. At the beginning of the process 
a frontier called the septum is created and at the last stage an engulfment 
is created to leave out the bacterial spore. The 1,893 terms have been also 
classified manually into the 6 biological stages. An average amount of 600 
terms can cover a given stage. The problem is related to morphological and 
fuzzy description of language. Where a strict formal description should used 
for instance "stage II” concerning the second stage of biological development, 
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Figure 3: Running time of svcR. At top, figure shows normalized speed of our 
Grid approach with MST-adj and KNN-adj according data size from 1 point to 
150 points ( parameters of svcR are Nu = 0.7, q = 1200, G = 13). At down, 
figure shows normalized speed when grid size parameter of svcR increases from 
5 to 40 points (Number of points = 150, Nu = 0.7, q = 1200). 
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Figure 4: Samples of terminological data sets. 


an expert could use ’’during the first stages of sporulation” or ”at the onset 
of sporulation” or ”at stage I-II” or ’’after septum formation” ...etc. Moreover 
complexity of description, we can imagine insofar because 600 terms per class 
on to only 1,893, is that lots of terms are not exclusive to one stage (i.e. one 
class). Lots of expressions can designate a stage and often several stages at the 
same time. 

Why do a clustering method such as SVC could be of interest ? We observed 
that: 

1. Most of terms describing occurrence of a gene activation/inhibation/regulation 
are not expressed in a simple regular way such as ”at stage 2” or ”at stage 

3”. But terminology of temporal knowledge has a variable expressivity; 

2. Lots of terms are not exclusive to a stage. 

In such usage context, the svcR technique could help an expert to build rules 
about expressions to get equivalence between a set of expressions and a mapping 
of rule with a specific class. We decided to compare which language model can 
bring benefit for term description and for each language model which kernel can 
be also more relevant. We had manually selected a list simple morphological 
radicals (11 tokens), word bigrams (a restricted sample of 500 on to 1,477) 
and word trigrams (a restricted sample of 500 on 2,179) from the whole set of 
terms. Figure [4] gives a sample of some linguistic expressions. In our clustering 
experiments we first made a sample of 98 terms and 4 classes, similar in size 
with iris data (Section [(h4| ). 

After viewing which language model (term-radical, term-term, term-bigram, 
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term-trigram) and which kernel are enough efficient, we apply the language 
model and the kernel to the whole set of 1893 terms. 

4.2 Kernels 

As terms (that are strings), intrinsically and without textual context, can be 
statistically compared pairwise (in a Levenshtein way) or using a bag-of-words 
(in the Jaccard way) we compared these approaches, in addition to robustness 
due to randomized non null value in the Jaccard case. The Levenshtein distance 
is an editing distance based on the cost to transform a string into another m 
Assume a and b being two strings. Let cq be the sub-string consisting of the 
first i symbols of string a where 0 < i < ||a|| and bj be the sub-string consisting 
of the first j symbols, iteratively we obtain the Levenshtein distance at position 
i and j: 


D i :j = min(Di_ij + w 1 , A-iy-i + w s , Aj-i + w D ) (16) 

where w 1 , w D and w s are weights of insertion, deletion and substitution 
on operations respectivaly and Dq q = 0 Finally D a b represents the weighted 
Levenshtein distance. From its expression we define the Levenshtein radial base 
kernel: 


LRB(xi,x 2 ) = e-s-K^-Mr (17) 

We also define a kernel using only the component of Levenshtein distance 
between a pair of terms: 


RBL( Xl ,x 2 ) = e~ qllDloJ (18) 

Equation [l7] and Equation [18] are a composition of a semi-positive definite 
kernel (the radial base function) so the final kernels are also semi-definite posi¬ 
tive. The Jaccard index is a similarity index [19] that is useful to assess the sim¬ 
ilarity between two objects computed only knowing the set of their attributes, 
and not the whole set of attributes being often huge and not describing the given 
objects. Its expression is the following knowing that a string s i is composed 
with tokens sio,-.-, s im and string s 2 is composed with tokens S 2 o,---, s 2 k- 


Ji 2 = J(S 1 ,S 2 ) 


K'S'lO) Am} FI {S 2 O 1 ■■■, S 2 k }| 
|{S , 10) ... ) Si m }U{5 20 ,...,5 2fc }| 


(19) 


Hence we define a Jaccard-radial base kernel (JRB) according vector defined 
with Jaccard index with other terms (the data matrix is symmetric): 


jrb(xi,x 2 ) = (20) 

We also define a kernel using only the component of Jaccard index between 
a pair of terms: 
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RBJ{x i, a.’ 2 ) = e 


( 21 ) 


II 


Equation [20] and Equation [2l] are a composition of a semi-positive definite 
kernel (the radial base function) so the final kernels are also semi-definite pos¬ 
itive. [32] and ]3] have been respectively adapted a kernel approach with a 
Levenshtein and a Jaccard similarity coefficient and proved their robustness 
though their classical simplicity. In our data representation we have used four 
Kernels: the Levenshtein-radial base (LRB), the radial base-Levenshtein (RBL), 
the Jaccard-radial base (JRB) and the radial base-Jaccard (RBJ). We also have 
introduced noise in the data matrix such that if the Jaccard coefficient gives 0 
we assign a random non null value to the data matrix component. We call this 
fuzziness, Jaccard+. The vector approach using such distance and index heuris¬ 
tics in natural language processing sets the representation of description by sets 
of words but property of such sets can be modulated. For instance co-frequency 
in textual context (with left and right collocations) (lexical-based similarity), or 
string inclusion between two terms (dictionary-based similarity), or ontological 
nodes shared between two terms (conceptual-based similarity). We focused on 
the second way and we compared several cases of dictionary to build the matrix 
of similarity. As variables to classify of course we used the sets of terms, and 
as attributes the set of radicals (RD), the sets of terms itself (TM), the sets of 
bigrams (BG) and the set of trigrams (TG). 

4.3 Results 

As we see in the Section [2] for the presentation of the SVC method coupled to 
our geometric approach for cluster extraction, if no clear geometric separation 
in data occurred on the 2-D map of correspondence analysis coordinates, the 
method is unsuccessful. Figure [5] shows different plots of the different cross 
between data attributes and distances. We see on these maps that TM-TM 
Levenshtein, TM-RD-Jaccard and TM-RD Jaccard+ can produced interesting 
maps for SVC application. Thus on each set of the data matrix we applied the 
cluster extraction to compare the efficiency of class retrieval. Figure [6] shows 
performance of the method. The Jaccard kernel gives best results with a good 
separation and extraction of classes. And the variant set introducing random 
noise in the matrix still becomes successful with 2 misclassified items on 98. 

Now we adopt the best obtained clustering setting that is a term-radical 
matrix (language model) and Jaccard Radial base kernel. Now to study scala¬ 
bility efficiency we expand amount of terms and radicals taken into account for 
Jaccard distance computation. Independently of clusters purity (class homo¬ 
geneity) , impact of features (radicals) is a warranty to make a good separation 
between similar terms. We do not forget that support vector machine is a non 
linear method which is efficient only if data are separable. Hence recall that 
role of features is to make similarity clue between terms, role of Jaccard index is 
to capture similarity, role of 2D component analysis is to capture main features 
that make separation between data, and finally role of support vector clustering 
is to capture bounds of cluster thanks to their geometric separation. Figure [7] 
shows that too many features do not make separation of data (attribute DName 
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Figure 5: In this table each display is dependant upon features describing lin¬ 
guistic phrases (TM or terms) i.e. with terms themselves (TM-TM), with rad¬ 
icals (TM-RD), with bigrams (TM-BG) or with trigrams (TM-TG), secondly 
results depends upon kernel used for clustering Levenshtein, Jaccard or Jaccard 
with artificial noise. Displays represent data classes in green, red, blue, yellow 
colors and in 2-d maps of the COA components. 
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Figure 6: Clustering with SVC-geometric hashing (Nu = 0.9, G = 13, q chosen 
at best between 1 and 10000). 


is changing for each four data sets): 

But too few features make too few set of clusters. A medium set of features 
can lead to a good number of clusters. In our case 38 features describing struc¬ 
ture of terms induce 15 clusters easily distinguishable visually. Recall the set 
of terms is made of 1893 terms describing 6 stages of sporulation process as we 
mentioned in Section l4.ll 

Lots of terms belong to several stages (in the sense of classes). Even typical 
string token relevant of a class can belong to different stages. It is mainly due 
to biological stage results from microscopy studies, so visual patterns and often 
a co-occurrence of patterns can be simultaneously typical of a stage but individ¬ 
ually we can observe a pattern occurs during several stages as mother cell and 
compartmentalization (beginning at stage 2 and staying at stages 3, 4, 5, 6), en- 
gulfment (beginning at stage 3 and staying at stages 4, 5, 6), septum (beginning 
at stage 1 and staying at stages 2, 3, 4). This property of cross-membership is 
hard enough to compute as a mapping between a specific term to a unique class. 

In our results (Figure[7]) getting more clusters (15) than classes (6) induces that 
terms can be misclassihed (green points) but make a variety of specific clusters 
from which we expect they capture patterns association that should be used to 
define rule of an automaton. For instance among the 15 clusters, a specific one 
gives the following members : 

compartmentalized activation , compartment-specific activation 
We can imagine a rule associated to stage 2, 3, 4 and 5 : ’’compartment activa¬ 
tion” . Another one gives the following members : 

slow postseptational, prespore-specific SpoIIIE synthesis, endospore coat, 
endospore coat assembly, endospore coat component, forespore coat, from the endospore coat, 
cortex and/or coat assembly, spore coat and cortex. 
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Figure 7: Clustering with SVC-geometric hashing (Nu = 1, G = 30, q = 2000, 
JRB+); Each column means different number of terms and number of features, 
datasets sizes increase from left to right. Below are the number of clusters 
extracted with svcR. Yellow color represent clusters, red is data color for major 
class of a cluster and green is data color but not belonging to major class of a 
cluster. 

We can imagine a rule associated to stages 3, 4, 5 and 6 : ’’endospore coat”, 
’’coat cortex”,’’cortex coat”. From these clusters of terms coat, cort, prespore, 
endospore, postseptational, forespore are in the sets of features. In this process 
of lexical rule definition the user plays an important role in such way a cluster 
do not give information directly exportable as an automatically defined rule. 
Visualization of clusters by an expert leads to identification of patterns associ¬ 
ation to include in lexical rules. Especially by the fact that elements taken into 
account are features and knowledge about features is required to say that these 
rules will be applicable to a set of classes (biological stages). The methodology 
makes us to understand, but it is not a discovery, that clustering mixes several 
components of different categories. Nevertheless it can be efficient to identify 
relevant features to identify as a lexical pattern to build rules for information ex¬ 
traction, in our case information extraction of a biomolecular process described 
linguistically and formally by several stages (i.e. a scenario in the domain of 
biology). 

5 Comparison with other techniques 

In this section, we discuss behavior of concurrent clustering methods, existing 
kernels and interpretation of SVC clustering capacity. Below a simple general 
R utility function, gets outputs of used R clustering functions (k-means, svcR, 
hierarchical) and exploits a data property that is insertion of the class number 
in each term (as ”4 coat protein” meaning ’’coat protein” belongs to class 4). 
Hence using grep function it is very easy to find the repartition of classes over 
clusters : 

TabEval <- function(Dat) { 

M <- matrix(nrow = (max(Dat[]) + 1), ncol = 8, 0) 
for( k in l:max(Dat[]) ){ 

Stat <- c() 

Size <- length(Dat[ Dat == k]) 
for(i in 1:6) -[ 
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Figure 8: Clustering with SVC-geometric hashing (left), hierarchical agglomer- 
ative clustering (centre), k-means (right); Data are 1893 terms with 6 classes 
and 37 features using a Jaccard-radial base kernel. 

GR <- grep(i, names(Dat[ Dat == k]) ) 

Stat <- c(Stat, 100 * length(GR) / Size ) 

> 

Stat= c(Stat, 0, Size ) 

M[k,] <- Stat 

> 

Stat <- c() 
for(i in 1:6) -[ 

Size <- length(Dat[]) 

GR <- grep(i, names(Dat[]) ) 

Stat <- c(Stat, 100 * length(GR) / Size ) 

> 

Stat <- c(Stat, 0, Size ) 

M[nrow(M), ] <- Stat 
print(M, digits = 1) 

5.1 Classical clustering 

Algorithms such as k-means (KM) and hierarchical clustering (HC) are widespread 
poor knowledge techniques using metrics to find automatically clusters in any 
kind of data. Figure [8] shows graphically how such clusters could be represented. 
About svcR and KM, 2-dimensional coordinates come from component anal- 
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ysis. On the KM map only centroids represent clusters (as star plotting char¬ 
acters). HC (Figure [8j center) displays a dendrogram where branches mean 
clustered points and require a cut-off at a level of the tree to catch clusters. In 
R, we used kmeans function from stats package [40], and hclusterpar function 
from amap package [291 . 

As Data contains 6 classes and svcR approaches with JRB kernel induces 
extraction of 17 clusters we settle 30 clusters extraction as settings for both KM 
and HC function. Figure [9] shows the content of clusters and class distribution 
for each approach (hierarchical, k-means and SVC). The right column of each 
result set means the size of each cluster. The last line means distribution over 
classes from the whole dataset as baseline (it means that 12% of terms belong 
to class 1, 19% to class 2, 20% to class 3, 20% to class 4, 16% to class 4, 12% 
to class 6 and size of set is 1893 terms). First we can observe that distribu¬ 
tion profile in cluster size is similar between HC and svcR. Secondly, looking at 
over-representation of classes over clusters HC and KM do not achieve better 
discrimination of terms across the 6 classes some clusters are better over. Lan¬ 
guage ambiguities seem to be a real bottleneck for all methods when usage is 
based on a Jaccard-Radial Kernel. But what happens when string kernels are 
used ? 

5.2 String kernels 

[28] and [30] promoted kernel strings to get semantic knowledge from texts. The 
string kernels calculate similarities between two strings by matching the common 
substring in the strings. A standard string kernel is the constant one (SK- 
constant) and assess similarities even is characters are matching in any order, 
and higher is the return value when order is respected and size of matching is 
bigger. Exact matching of n characters is called spectrum kernel (SK-spectrum) 
[41] . For instance let suppose a string of 29 characters and estimate value of a 
string with itself, SK-constant return 3165, SK-spectrum return 430. 

If we pick two terrnes from our biology term data set : SK-constant (’’in¬ 
ner coat”, ”in the mother cell”) = 22, and SK-spectrum (’’inner coat”, ”in the 
mother cell”) = 15 ; another pair give SK-constant (’’inner coat”, ’’initiation of 
sporulation”) = 27, and SK-spectrum (’’inner coat”, ’’initiation of sporulation”) 
= 24. Variation between both pairs are not far according string kernels, though 
terms of the first pair are from one class (class 2) and the other pair compares 
terms from different classes (class 1 and class 2). We built a kernel matrix using 
both string kernels and achieved cluster labelling with this similarity informa¬ 
tion. Result is shown in Figure [T0| : 

Even if SK-constant shows some capability to isolate clusters, a big cluster 
contains 1600 items, hence 85% of information. Such kernel is though challeng¬ 
ing, perhaps including more lexical knowledge. 

5.3 Clusterability of a SVC model 

Section |2.1| presents a general framework of a kernel method. It does not mean 
any assumption about clustering but moreover about classification. Neverthe¬ 
less SVC is not a new technique in itself. SVC has been seen as one-cluster 
discovery since a ball in the dual space is targeted. Hence it was described in 
detail for a long time as a one class approach applied to novelty detection when 
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Figure 9: Class distribution over clusters resulting from HC (left 4 columns), 
KM (centre 4 columns) and svcR (right 4 columns). Only three first classes are 
displayed (over 6) i.e. Cl to C3. These classes are hand-made built and each 
term is tagged in the data matrix with one of these classes. After clustering, 
we collect class membership of terms for a given cluster. In the table, each line 
is the distribution of a given cluster (different over each method HC, KM or 
svcR). Each line shows the weigth in percent for each class in the cluster. The 
last line represent a baseline, showing how should be the weigth of a class if all 
terms should form a unic cluster. Hence for instance class 1 represents 12% of 
the set of terms. On each line we dispaly also (column #) the number of terms 
belonging to the cluster. 


information is deviating from a block of well known information. In R-project, 
kernlab package [22] (2T] implements novelty detection task. When running 
one-class kernel to our dataset it returns a model of 394 support vectors, with 
nu = 0.2 and cross-validation 0.205. Our observation is that SVC performs 
well cluster extraction (or labeling) from a 2-dimensional map, depending on 
existence condition of clusters. It means that data ought to be separable in the 
2-d map. Separability can be managed by composition of a metric with a radial- 
based function over the whole input matrix dimensions. A possible explanation 
for capability of SVC to identify clusters is related to the same problem as try¬ 
ing to flatten the skin of an orange onto a tabletop. In this case, projection 
is a procedure to transform locations and features from the three-dimensional 
surface onto the two-dimensional paper in a defined and consistent way. The 
result is some slight bulges and a lot of gaps. The transformation of map infor¬ 
mation from a sphere to a flat sheet can be accomplished in many ways but no 
single projection can accurately portray area, shape, scale, and direction. SVC 
clustering takes origin from capacity within projections to distort. 


6 Conclusion 

We developed, improved and applied a density and kernel based method called 
support vector clustering (SVC) we implemented as an R-project package (svcR). 
The package is available from the CRAN R project server (http://cran.r-project.org/ 
see Software, Packages; svcR version 1.4), and downloadable from the R graph¬ 
ical user interface (required R libraries : quadprog, ade4 and spdep). First we 
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proved that mapping points in the data space to a grid and using the sphere 
radius from the attribute space and a k-nearest-neighbor approach improves 
time consumption for cluster labeling. In this sense, SVC can be seen as an 
efficient cluster extraction if clusters are separable in a 2-D map. Secondly we 
found a representation for term clustering using a mixed Jaccard-Radial base 
kernel and we proved its efficiency with SVC for term clustering in a natural lan¬ 
guage processing task as lexical classification (i.e. oriented ontology knowledge 
acquisition). Some investigations remain under R implementation to integrate 
C functions for matrix acquisition so as to make the toolkit more scalable in 
data size. Semantic and lexical-based kernels are promising for application in 
text mining frameworks. Yet it must understand how to select and integrate 
attributes for the description of terms. We aim at investigating in future work 
extraction of clusters over more than 2 dimensions, and test of robustness for 
non-separable data. 
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